feat(e2e): agentic verification loop with MCP Playwright browser layer#128
Draft
informatico-madrid wants to merge 399 commits intotzachbon:mainfrom
Draft
feat(e2e): agentic verification loop with MCP Playwright browser layer#128informatico-madrid wants to merge 399 commits intotzachbon:mainfrom
informatico-madrid wants to merge 399 commits intotzachbon:mainfrom
Conversation
…ore and FORK_GOALS - Rename selector-map.skill.md → homeassistant-selector-map.skill.md (HA-specific examples, reusable as reference for other projects) - Add ui-map-init.skill.md: agnostic protocol for generating ui-map.local.md in any project (runs once per project, output is gitignored) - Add **/ui-map.local.md to .gitignore - Update FORK_GOALS.md Phase 0.1: remove coupled selector hierarchy from description, reference new filenames, stay generic
…meassistant-selector-map.skill.md)
- .gitignore: add **/ui-map.local.md (project-specific selector map, never committed) - FORK_GOALS.md Phase 0.1: update filenames, remove coupled selector hierarchy from description, stay domain-agnostic
…uct-manager + exploratory qa-engineer
…-playwright skills - Add playwright-env.skill.md: resolves appUrl, authMode, credentials refs, seed data, browser config and safety limits before any browser tool call. Emits ESCALATE if critical context is missing instead of guessing. - Add playwright-env.local.md.example: full template covering all auth modes (none, form, token, cookie, basic, oauth, storage-state). Gitignored by design. - Update mcp-playwright.skill.md (v1→v2): add Step -1 that loads playwright-env before the dependency check. Add anti-pattern entry for skipping Step -1. - Update playwright-session.skill.md (v1→v2): reads appUrl/authMode/browser config from .ralph-state.json→playwrightEnv instead of hardcoded values. Add explicit auth flow per mode (form, token, cookie, basic, storage-state, oauth/sso). Add ESCALATE rule for oauth/sso without pre-auth session. README.fork.md already references playwright-env.skill.md and playwright-env.local.md.example — both now exist.
- .gitignore: add playwright-env.local.md and .ralph-auth-state.json to prevent accidental commit of local auth config and storage-state files. - playwright-env.skill.md (v1→v2): add explicit connectivity check step after appUrl is resolved. Uses curl with 5s timeout before writing playwrightEnv to state. Emits ESCALATE(app-not-reachable) if check fails. Checklist item updated from vague 'reachable' note to concrete step ref.
…w-5 seed order, yellow-6 stable-state timeout, white-7 jq race note
- playwright-session.skill.md (v2→v3):
- orange-3: expand token auth section with 3 concrete injection patterns
(localStorage, Authorization header, cookie fallback) so agent never
improvises. Add tokenBootstrapRule field to playwright-env.local.md format.
- yellow-5: clarify seedCommand execution order — runs in playwright-env
AFTER connectivity check, BEFORE writing playwrightEnv to state.
- yellow-6: replace vague 'stable state' with concrete criterion: call
browser_snapshot and check no [aria-busy] or loading indicators in tree;
retry once after 1000ms if found. Document the tool to use.
- ui-map-init.skill.md (v1→v2):
- orange-4: add Step -1 (load playwright-env) before Step 0, matching
the pattern in mcp-playwright.skill.md. Stops if ESCALATE received.
- playwright-env.skill.md (v2→v3):
- yellow-5: add seedCommand execution step between connectivity check
and writing playwrightEnv to state. Only runs on local/staging.
- mcp-playwright.skill.md (v2→v3):
- white-7: add note in Step 0 about jq write race condition in parallel
execution scenarios. Recommend basePath-scoped lock or sequential VE tasks.
Critical fixes: - playwright-session: fix token Pattern A (localStorage inject before goto fails on blank origin) - playwright-env: wrap RALPH_SEED_COMMAND in eval + quotes to handle args/spaces - mcp-playwright: replace npx @latest (downloads) with --no-install check + lock recovery Cache & isolation: - mcp-playwright: add RALPH_PLAYWRIGHT_ISOLATED setting, --isolated flag, lock recovery protocol - playwright-session: add Session End lock-file cleanup on close failure - playwright-env: add RALPH_PLAYWRIGHT_ISOLATED to Settings Reference + tokenLocalStorageKey Minor: - mcp-playwright: clarify snapshot-only vs screenshot-always contradiction - playwright-session: add CAPTCHA/2FA ESCALATE detection in form auth - playwright-env: add TTL warning for state fallback from previous session"
…on, mcp context reset, auth table header - ui-map-init: fix date -r (Linux only) → portable stat fallback for macOS - mcp-playwright: Step 0b — run always when isolated=false, not conditionally - playwright-session: clarify MCP server restart required between sessions - playwright-env: fix broken markdown table header in Authentication section
…sk numbering contract, spec-workflow e2e bridge - reality-verification: add project-type detection (api-only/cli/library skip Playwright entirely) - reality-verification: document VE task numbering contract (VE0 = ui-map-init, 4.3 = verify fix) - spec-workflow: reference e2e skills in implement phase, add project-type to requirements phase activities
… without issue ref v0.4.0
mcp-playwright MUST load before playwright-session — session start reads .ralph-state.json → mcpPlaywright which is only written by mcp-playwright Step 0. Loading playwright-session first causes it to find the key absent and fall into degraded mode incorrectly. Correct order (matches spec-workflow/SKILL.md and phase-transitions.md): 1. playwright-env 2. mcp-playwright 3. playwright-session 4. ui-map-init (VE0 only)
…fig, not agent commands The skill was documenting npx launch commands as if the agent executes them directly. In Claude MCP context the agent never launches the server — the human configures it in their MCP client (claude_desktop_config.json or equivalent). The agent only calls browser_* tools that the already-running server exposes. - Remove Protocol A section implying agent runs npx to start server - Add human-config note: --isolated and --caps=testing must be set in the MCP server definition, not invoked by the agent at runtime - Preserve isolation mode documentation as context for the agent (so it knows what behaviour to expect from the server)
…/npx) The MCP server is a long-running process managed by the human in their MCP client config. The agent never launches or kills it. - Remove Steps 4 & 5 from Session Start (pkill + npx launch) - Remove "Stop the MCP server process" from Session End - Remove pkill from Session End abnormal-termination block - Context Isolation: replace "Restart MCP server" with ESCALATE note - Cleanup Checklist: remove server-process check; agent responsibility ends at browser_close + lock recovery Consistent with mcp-playwright.skill.md v6 (human-config note).
…xt myth, resolvedAt timestamp, ui-map session clarity, spec-executor session end Issues fixed: - #2 🔴 playwright-session Start Steps 4/5/6: auth sequence broken for cookie/storage-state. cookie and storage-state inject BEFORE any navigation; form goes to loginUrl not appUrl; token (Pattern A) navigates to appUrl then injects. Replaced unconditional Step 4 with authMode-conditional branching table. - #1 🟠 playwright-session Step 4: removed false claim 'MCP server creates a fresh context on each navigation'. Added explicit note: browser_navigate does NOT reset state within a session. Isolation comes from --isolated flag at server startup, not per-navigation. - #3 🟡 playwright-env Write State: added resolvedAt ISO timestamp so staleness warning in Resolution Order point 3 is actually computable by the agent. - #4 🟠 ui-map-init: added explicit note that VE0 opens its own browser session (does not reuse any prior session) and must follow Session End from playwright-session when done. - #6 🟠 spec-executor: added explicit Session End reminder after VE task completion so sessions are not leaked between consecutive VE tasks.
…row for multiple VE tasks spec-executor is the authority on session policy between VE tasks. The "Multiple VE tasks in same spec: Same context OK" row contradicted the mandatory Session End rule in spec-executor. Removed to eliminate the ambiguity. Added a clarifying note that "during" scope is sub-steps within a single VE task, not across VE tasks.
- Replaced vitest (Node.js) runner with bats (Bash test framework) - Updated Test Discovery section with bats commands - Updated Mock Boundary table with bash-appropriate mocks - Updated Concurrent writes row with background subshells pattern - Verified: grep vitest returns CLEAN Co-Authored-By: Claude Opus 4.6 <[email protected]>
- design.md: added bash to ACK message block, text to CLOSE Thread Example - requirements.md: added text to FR-2 message format example - Note: line 289 and 306 had no bare fences in current file state Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ence detection, human as participant Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ents Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…e bug, inconsistencies, reviewer improvements Co-Authored-By: Claude Opus 4.6 <[email protected]>
…-executor, format fixes - C1: spec-executor.md — add flock-based atomic write (was bare cat >>) - C2: requirements.md — fix NFR-1 vs FR-13 contradiction (clarify flock for chat.md, temp+rename for state) - M3/M4: BLOCK→HOLD in spec-executor.md and external-reviewer.md (5+ references) - M5: chat.md — real entries now follow canonical ### [writer → addressee] format - M8: design.md — remove language tag from closing backticks - N1: templates/chat.md — add text language to fenced block - N5: chat.md (spec instance) — add text language to fenced block - N8: .progress.md — mark temp+rename note as superseded - N11: tasks.md — rename lastReadIndex→lastReadLine globally - N13: task_review.md — remove duplicate timestamp in resolved_at Co-authored-by: Qwen-Coder <[email protected]>
… progress.md typo - N2: index-state.json — 'complete' → 'completed' for ralph-quality-improvements - N3: index.md — empty status → 'done' for ralph-quality-improvements - M6: research.md — clarify two activation thresholds (existence vs existence + 1 message) - N7: .progress.md — 'docs/agen-chat/' → 'docs/agent-chat/' Co-authored-by: Qwen-Coder <[email protected]>
Feat/agent chat protocol
Add mandatory chat.md check step between task parsing and delegation. The coordinator now reads new messages from chat.md (after lastReadLine), blocks on HOLD/PENDING, responds to OVER, and announces each task before delegating — activating the bidirectional protocol already defined in spec-executor.md and external-reviewer.md.
The Chat Protocol section already defined the correct bidirectional behavior but was disconnected from the numbered Task Loop. Add step 2a so the executor reads chat.md on every iteration BEFORE checking task_review.md (step 2b), mirroring the pilot callout pattern now enforced in coordinator-pattern.md.
Expand Step 4 to cover all signals defined in external-reviewer.md: - INTENT-FAIL: log + wait 1 cycle before delegating - DEADLOCK: hard stop, surface to human - URGENT: treat as HOLD - CONTINUE: no-op, proceed - ALIVE / STILL: heartbeats, ignore Previous commit only handled HOLD, PENDING and OVER.
… layer, EXECUTOR_START as Layer 0 Changes from V2 (agent self-fix): - Role Definition: add 3 lines making chat.md usage explicit in Integrity Rules - Chat Protocol: add Step 2b — read task_review.md before delegating (defense-in-depth) - Verification: add Layer 2b anti-fabrication (independent verify command execution) - Update all 'all 3 layers' references to 'all 4 layers' Changes from V3 (external analysis): - Reposition Step 2b as standalone section BEFORE the Chat Protocol <mandatory> block - Add defense-in-depth comment explaining intentional duplication vs spec-executor - Integrate EXECUTOR_START verification as Layer 0 (was floating section, now numbered blocker) - Update Verification Summary to list all 5 layers (0+4)"
…s.md unmark G4 — Section 0 Bootstrap: add step to read chat.md and check for active HOLD/PENDING/DEADLOCK signals before starting the Review Cycle. Prevents reviewer from starting blind when a conversation is already in progress. G2 — Section 6b unmark: wrap tasks.md demark write in flock exclusive lock (same pattern as chat.md atomic append) to prevent race condition with coordinator reading tasks.md to advance taskIndex concurrently."
Add TASK_AMBIGUOUS pre-execution signal (inspired by ChatDev dehallucination communicative pattern) to break the retry loop caused by ambiguous task blocks: spec-executor.md: - New section 'Ambiguity Detection (Pre-Execution)' before Task Loop - Criteria for emitting TASK_AMBIGUOUS: contradictory instructions, missing required context, impossible constraints, undefined references - Output format mirrors TASK_MODIFICATION_REQUEST (structured, parseable) - Guard: max 1 TASK_AMBIGUOUS per task to prevent clarification loops coordinator-pattern.md: - New handler in 'After Delegation' for TASK_AMBIGUOUS output - Coordinator enriches task block with clarification and re-delegates - Does NOT increment taskIteration (ambiguity is spec error, not execution error) - Logs clarification applied to .progress.md for auditability - Max 2 clarification rounds per task before escalating to human channel-map.md (new): - Documents all shared filesystem channels, writers, readers, timing - Explicitly marks channels with multiple writers (race condition risk) - Explains locking strategy per channel - Reference document for future protocol decisions and new agent onboarding
…lls for VE tasks - Added detailed E2E / VE task review section in external-reviewer.md to enforce proper review protocols. - Updated spec-reviewer.md to include a new E2E review rubric for better evaluation of VE tasks. - Mandated inclusion of `Skills:` metadata in VE tasks within task-planner.md to ensure necessary skills are loaded. - Revised coordinator-pattern.md to enforce strict anti-patterns for VE tasks and required skills. - Enhanced phase-rules.md with specific requirements for VE2 tasks to ensure comprehensive user flow verification. - Updated verification-layers.md to differentiate artifact type instructions for VE/E2E tasks. - Introduced a new SKILL.md for E2E to outline the skill suite for end-to-end testing. - Expanded homeassistant-selector-map.skill.md with native navigation documentation for Home Assistant. - Enhanced playwright-session.skill.md with a mandatory unexpected page recovery protocol. - Updated tasks.md template to require skills for all VE tasks and clarified task structures.
…s and improve clarity in task templates
Fixes applied (7 real issues from 18 comments): - tasks.md: add playwright-session to VE0 Skills (both POC/TDD blocks) - tasks.md: add selector-map to VE2 Skills (both POC/TDD blocks) - task-planner.md: align Skills example to **Skills**: bold format + fix dup </mandatory> - external-reviewer.md: resolve tool-permissions contradiction (allow post-task test exec) - homeassistant-selector-map.skill.md: fix false 'English across locales' claim, use data-panel-id for sidebar items instead of localized getByRole names - spec-reviewer.md: simplify 'No fixed waits' FAIL to match absolute PASS prohibition - coordinator-pattern.md: update 3 'Required Skills' refs to 'Skills:' metadata field False positives (not changed): - VE3 platform_skills: cleanup task doesn't need platform-specific skills - artifactType header in verification-layers: style preference, inline approach works - || table syntax in playwright-session: valid Markdown single-pipe tables - [VERIFY] in submode detection: layered design already handles this correctly - MD040 fence warnings: agent instruction files, not user-facing rendered docs
…-chat-coordinator feat(e2e): enhance verification processes and introduce mandatory ski…
Update fork with latest changes from tzachbon/smart-ralph
…er signals in QA process
…detection and handling in agents
…entes external-reviewer y qa-engineer
Improve flat flow
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds an agentic verification loop to Smart Ralph: the agent reads a user story, reasons about what "working" looks like, explores the system creatively, and self-corrects when something breaks — without scripted steps or Gherkin.
All additions are purely additive and live inside
plugins/ralph-specum/. Zero changes to core flow, existing commands, or stop-watcher behavior.Problem
Classical test coverage strategies were designed for humans writing tests for humans to read. When an agent writes and executes those same tests:
The root issue: classical testing is a human artifact, not a verification mechanism designed for agentic loops.
The Idea
We call this the Verification Contract: a lightweight section in each
requirements.mdthat tells the agent what to observe, not how to test.What's Added
1. Verification Contract in specs (
templates/requirements.md,agents/product-manager.md)Each spec's
requirements.mdgets a## Verification Contractsection:product-manageris updated with guidelines to populate this from user stories.2. Exploratory
qa-engineer(agents/qa-engineer.md)New
[STORY-VERIFY]mode: given a user story + Verification Contract, the agent derives and executes checks autonomously. No Gherkin. No scripted steps. Emits structured signals:VERIFICATION_PASS,VERIFICATION_FAIL,FINDING.Example checks derived from a single story ("filter invoices by date"):
None of these come from a script. They come from reasoning about intent.
3. Repair loop (
hooks/scripts/stop-watcher.sh)When
VERIFICATION_FAILis detected:New
.ralph-state.jsonfields:repairIteration,failedStory,originTaskIndex.4. Regression sweep (
hooks/scripts/stop-watcher.sh)After
ALL_TASKS_COMPLETE, reads**Dependency map**from the completed spec and runs targeted[STORY-VERIFY]sweeps on dependent specs only. Three tiers: Local → Invariants → Full (nightly).5. MCP Playwright browser layer (
plugins/ralph-specum/skills/e2e/)Four skill files covering the full browser verification protocol:
playwright-env.skill.mdmcp-playwright.skill.md@playwright/mcpprotocol: tool selection, verification sequence, devtools tracing, PASS/FAIL/DEGRADED/ESCALATE signalsplaywright-session.skill.mdui-map-init.skill.mdtask-plannerauto-injects aVE0task (ui-map-init prerequisite) before any spec that uses Playwright. Zero human memory required.Key design decisions:
--isolated --caps=testing).@playwright/mcpis missing →VERIFICATION_DEGRADED+ escalate to human.ui-map.local.mdis gitignored. Local to each developer, grown incrementally by the agents that use it.Files Changed
New files:
plugins/ralph-specum/skills/e2e/playwright-env.skill.mdplugins/ralph-specum/skills/e2e/mcp-playwright.skill.mdplugins/ralph-specum/skills/e2e/playwright-session.skill.mdplugins/ralph-specum/skills/e2e/ui-map-init.skill.mdplugins/ralph-specum/skills/e2e/e2e-verify-integration.skill.mdplugins/ralph-specum/skills/e2e/homeassistant-selector-map.skill.mdFORK_GOALS.mdModified files (additive only):
templates/requirements.md— added## Verification Contractsectionagents/product-manager.md— guidelines to populate Verification Contractagents/qa-engineer.md—[STORY-VERIFY]modeagents/task-planner.md— VE task auto-injection, VE0 prerequisiteagents/spec-executor.md— skill load order for VE tasks, session end reminders, ui-map patch after data-testid changesagents/architect-reviewer.md— mandatory test strategy with mock rulesplugins/ralph-specum/hooks/scripts/stop-watcher.sh— repair loop + regression sweep.gitignore—**/ui-map.local.mdWhat This Is NOT
qa-engineer(extends it)Summary by CodeRabbit
New Features
Documentation